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Implementation of Lexical 
Analysis 



Specifying lexical structure using regular 
expressions 



Outline 


Finite automata 

Deterministic Finite Automata (DFAs) 

Non-deterministic Finite Automata 
(NFAs) 




Implementation of regular expressions 

RegExp => NFA => DFA =>Tables 
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How regular expressions are used to 
construct a full lexical specification on a 
programming language? 



Notation 



Ay least one: A + 

Union: A | B 
Option: A + s 
Range: 'a'+'b'+.-.+'z' 

Excluded range: 

complement of [a-z] 



AA* 

A + B 
A? 

= [a-z] 
[ A a-z] 
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Regular 
Expressions 
in Lexical 
Specification 



A specification for the predicate 

s e L(R) ????? 

What we want to do? 

We want to know whether a given string is in 
this language or not. 

But a yes/no answer is not enough! 

Instead: partition the input string into tokens. 
Each one of these tokens is in L(R). 

We adapt regular expressions to this goal 
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Regular 
Expressions 
=> Lexical 
Spec. (1) 



Steps to construct a full lexical specification 
from regular expressions 

Write a rexp for the lexemes of each 
token 

Number= digit + 

• Keyword = 'if' + 'else' + ... 

Identifier = letter (letter + digit)* 

• OpenPar = '(' 
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Regular 
Expressions 
=> Lexical 
Spec. (2) 



Construct R, matching all lexemes for all 
tokens 



R = Keyword + Identifier + Number + ... 

= R 1 + R 2 + ... 

The union of all the regular expressions 
forms the lexical specification of a language? 
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Regular 
Expressions 
=> Lexical 
Spec. (3) 



3. Let input be x 1 ...x n 
For 1 < i < n check 

x^-.Xj e L(R) 

4 If success, then we know that 

x 1 ...x i e L(Rj) for some ] 

Remove x^.-.x, from input and go to (3) 



Repeat until the input string is empty (the entire input is analyzed) 
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There are ambiguities in the algorithm 



Ambiguities 

(i) 



How much input is used? What if 

x 1 ...x i e L(R) and also 
• x 1 ...x K e L(R) 



Rule: Pick longest possible string in L(R) 
The "maximal munch" 
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Ambiguities 

(2) 



Which token is used? 
What if 

• X 1 ...x j e L(Rj) 
and also 

* X 1 ...x j e L(R k ) 



Rule: use rule listed first (j if j < k) 

Treats "if" as a keyword, not an identifier 
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Error 

Handling 



• What if 

No rule matches a prefix of input ? 

Problem: Can't just get stuck ...Compilers should 
do good error handling. 

* Solution: 

Write a rule matching all "bad" strings 
ERROR=[all strings not in the lexical specification] 
Put it last (lowest priority) 
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Summary 



Regular expressions provide a concise notation 
for string patterns 



Use in lexical analysis requires small extensions 
To resolve ambiguities 
To handle errors 



Good algorithms known 

Require only single pass over the input 
Few operations per character (table lookup) 
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Regular expressions = specification 

Finite automata = implementation 
mechanism for regular expressions 



Finite 

Automata 



A finite automaton consists of 
An input alphabet 2 
A finite set of states S 
A start state n 

A set of accepting states FcS 
A set of transitions state — > in P ut state 
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Transition 



Finite 

Automata 



— > a s 



2 



Is read 

In state s 1 on input "a" go to state s 2 
If end of input and in accepting state => accept 

Otherwise => reject 
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Finite 

Automata 

State 

Graphs 



A state 

The start state 





An accepting state 




A transition 
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A finite automata that accepts only "1" 

ASimple 
Example 
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Another 

Simple 

Example 



A finite automaton accepting any number of l's 
followed by a single o 

Alphabet: {0,1} 
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Alphabet {0,1} 

What language does this recognize? 



Another 

Example 
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Another kind of transition: s-moves 

Epsilon 

Moves * Machine can move from state A to state B 

without reading input 

• Think of s-moves as a kind of free move. 
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Deterministic 
and Non- 
deterministic 
Automata 



Deterministic Finite Automata (DFA) 
One transition per input per state 
No s-moves 



Nondeterministic Finite Automata (NFA) 

Can have multiple transitions for one input 
in a given state 

Can have s-moves 
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Execution 
of Finite 
Automata 



A DFA can take only one path through the 
state graph 

The path is completely determined by 
the input (The machine has no choice) 



NFAs can choose 

Whether to make s-moves 

Which of multiple transitions for a single 
input to take 
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Acceptance 

ofNFAs 



An NFA can get into multiple states 




Rule: NFA accepts if it can get to a final state 
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NFAs and DFAs recognize the same set of 
languages (regular languages) 

NFAvs. 

DFA(i) 

DFAs are faster to execute 

There are no choices to consider 
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For a given language NFA can be simpler than 
DFA 



NFA vs. 
DFA (2) 



NFA 



DFA 




DFA can be exponentially largerthan NFA 



Dr. Sherin ElGokhy 



Regular 
Expressions 
to Finite 
Automata 



High-level sketch 




Regular 

expressions 



NFA 



DFA 




Lexical 

Specification 



Table-driven 
Implementation of DFA 



Dr. Sherin ElGokhy 25 



Regular 

Expressions 

toNFA(i) 



For each kind of rexp, define an NFA 
Notation: NFA for rexp M 



• For 8 




• For input Q 
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ForAB 



Regular 

Expressions 

toNFAU) 




• For A 
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Regular 
Expressions 
to N FA (3) 
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Example 
of ReqExp 
-> NFA 
conversion 



Consider the regular expression 

(i+o)*i 



The NFA is 

8 
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Choose the NFA that accepts the following 
regular expression: l* + 0 



NFAto 

DFA. 
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NFAto 
DFA: The 

Trick 



Simulate the NFA 
Each state of DFA 

= a non-empty subset of states of the NFA 
Start state 

= the set of NFA states reachable through 
s-moves from NFA start state 

Add a transition S — » a S' to DFA iff 

S' is the set of NFA states reachable from 
any state in S after seeing the input a, 
considering s-moves as well 
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An NFA may be in many states at any time 



NFAto 

DFA. 

Remark 



How many different states ? 



If there are N states, the NFA must be in some 
subset of those N states 



How many subsets are there? 
• 2 N - 1 = finitely many 
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DFA 

Example 



ABCDHI 




Dr. Sherin ElGokhy 



Choose the DFA that 
£ represents the same 

language as the given NFA 

o 



Implementation 



A DFA can be implemented by a 2D table T 
One dimension is "states" 

Other dimension is "input symbol" 

For every transition S, — » a S k define T[i,a] = 

k 



DFA "execution" 

If in state S; and input a, read T[i,a] = k and 
skip to state S k 

Very efficient 
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NFA -> DFA conversion is at the heart of tools 
such as flex 



But, DFAs can be huge 



Implementation 

(Cont.) 



In practice, flex-like tools trade off speed for 
space in the choice of NFA and DFA 
representations 



DFA : faster, less compact 

NFA: slower, consumes less memory 
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